Erich Schubert: Definition of Data Science
Everything is "big data" now, and everything is "data science". Because
these terms lack a proper, falsifiable definition.
A number of attempts to define them exist, but they usually only consist
of a number of "nice to haves" strung together. For Big Data, it's the 3+ V's,
and for Data Science,
this
diagram on Wikipedia is a typical example.
This is not surprising: effectively these term are all marketing,
not scientific attempts at definiting a research domain.
Actually,
my
favorite definition is this, except that it should maybe read
pink pony in the middle,
instead of unicorn.
Data science has been called "the sexiest job" so often, this has recently led to an integer overflow.
The problem with these definitions is that they are open-ended. They name
some examples (like "volume") but they essentially leave it open to call
anything "big data" or "data science" that you would like to. This is, of course,
a marketers dream buzzword. There is nothing saying that "picking my nose"
is not big data science.
If we ever want to get to a usable definition and get rid of all the hype,
we should consider a more precise definition; even when this means making
it more exclusive (funnily enough, some people already called above open-ended
definitions "elitist" ...).
Big data:
- Must involve distributed computation on multiple servers
- Must intermix computation and data management
- Must advance over the state-of-the-art of relational databases, data warehousing and cloud computing in 2005
- Must enable results that were unavailable with earlier approaches, or that would take substantially longer (runtime or latency)
- Must be disruptively more data-driven
Data science:
- Must incorporate domain knowledge (e.g. business, geology, etc.).
- Must take computational aspects into account (scalability etc.).
- Must involve scientific techniques such as hypothesis testing and result validation.
- Results must be falsifiable.
- Should involve more mathematics and statistics than earlier approaches.
- Should involve more data management than earlier approaches (indexing, sketching&hashing etc.).
- Should involve machine learning, AI or knowledge discovery algorithms.
- Should involve visualization and rapid prototyping for software development.
- Must satisfy at least one of these shoulds in a disruptive level.
But this is all far from a proper definition. Partially because
these fields are so much in flux; but largely because they're just too ill-defined.
There is a lot of overlap, that we should try to flesh out. For example,
data science is not just statistics. Because it is much more concerned with how
data is organized and how the computations can be made efficiently. Yet often,
statistics is much better at integrating domain knowledge. People coming from
computation, on the other hand, usually care too little about the domain knowledge
and falsifiability of their results - they're happy if they can compute anything.
Last but not least, nobody will be in favor of such a rigid definition and
requirements. Because most likely, you will have to strip that "data scientist"
label off your business card - and why bite the hand that feeds?
Most of what I do certainly would not qualify as data science or big
data anymore with an "elitist" definition. While this doesn't lessen my
scientific results, it makes them less marketable.
Essentially, this is like a global "gentlemans agreement". Buzz these words
while they last, then move on to the next similar "trending topic".
Maybe we should just leave these terms to the marketing folks, and let
them bubble them till it bursts. Instead, we should just stick to the established
and better defined terms...
- When you are doing statistics, call it statistics.
- When you are doing unsupervised learning, call it machine learning.
- When your focus is distributed computation, call it distributed computing.
- When you do data management, continute to call it data management and databases.
- When you do data indexing, call it data indexing.
- When you are doing unsupervised data mining, call it cluster analysis, outlier detection, ...
- Whatever it is, try to use a precise term, instead of a buzzword.
Thank you.
Of course, sometimes you will have to play
Buzzword Bingo.
Nobody is going to stop you. But I will understand that you are doing "playing
buzzword bingo", unless you get more precise.
Once you then have results that are so massively better, and really
disrupted science, then you can still call it "data science" later on.
You have been seeing, I've been picking on the word "disruptive" a lot. As
long as you are doing "business as usual", and focusing on off-the-shelf
solution, it will not be disruptive. And it then won't be big data science, or
a big data approach that yields major gains. It will be just "business as
usual" with different labels, and return results as usual.
Let's face it. We don't just want big data or data science. What everybody
is looking for is disruptive results, which will require a radical
approach, not a slight modification involving slightly more computers of what
you have been doing all along.